home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Shareware Grab Bag
/
Shareware Grab Bag.iso
/
090
/
cmln1285.arc
/
DADA.DOC
< prev
next >
Wrap
Text File
|
1986-02-27
|
16KB
|
319 lines
This file offers additional comments, notes and hints on the Dada
compiler given in DADA.PAS and described in Computer Language for
December, 1985.
1. Several of the types and variables that are defined globally in
DADA.PAS might better be declared as typed constants under Turbo
Pascal. They would then be local to the procedures that refer to
them and hence out of harm's way if some other procedure starts
meddling where it doesn't belong. They must be typed "constants"
rather than variables because they are static quantities: an error
message, for example, must retain its value between calls to the
error-listing routine. An added convenience of the Turbo typed
constants is that they can be initialized at the time they are
declared; thus the procedures InitKeywords and InitErrorList could
be discarded.
I avoided using typed constants because they are not a feature of
standard Pascal.
The following global variables (and their associated defined
types) might be made local in this way:
OutLine, OutPoint, OutBuf
CH
LineCount
TypeSet, TFset, RelOpSet, AddOpSet, MultOpSet
CurrentScope
Keywords
ErrorList
2. For simplicity I have defined the symbol table as a linked linear
list of invariant records. Several improvements could be consid-
ered. A binary tree structure would speed up the compiler; a hash
table would go even faster. The only routines that would have to
be changed are Find, Declare and Blot. It hardly seems worth the
trouble, however; Dada is a toy language, and the compiler is
unlikely to be given long programs, where speed of compilation
would become an issue.
If space rather than time is critical, a variant record could be
defined, with SymClass as the tag field. The VarType field is
needed only for variable entries and the Scope field only for
procedure entries. The savings, however, amount to only one byte
for VarType and two for Scope.
3. Doing a better job with errors is a lot of work. Stopping at the
first error is acceptable in a compiler that has a built-in
editor, where the position of the error can be indicated more or
less exactly on screen. In a batch-style compiler, the favored
strategy is to continue reading the source text, so that a single
compilation will in principle catch errors throughout the program.
The challenge is to recover from the first error well enough to
continue without issuing a Niagara of spurious error messages. The
Dada compiler is not capable of recovery. Removing the Halt state-
ment from the Error procedure will demonstrate this effectively.
4. The error-checking done here uses compiler directives peculiar to
Turbo Pascal. The main idea is that {$I-} allows the program to
continue after an input-output error, leaving an error code in the
system variable IoResult. If the operation is successful, IoResult
is zero. For other Pascal compilers equivalent services should be
available. A fuller and friendlier file-handling routine would
report the nature of the error, offer the user a second chance,
supply a default name for the output file, warn if an existing
file will be overwritten, etc.
A more convenient way of getting the file names would be to
retrieve them from the the command line. I decided against doing
so because the methods of accessing command-line arguments vary
between Pascal compilers and between operating systems.
5. The simplicity of the scanner owes much to the simplicity of the
language being recognized, but it is also made possible in part by
delaying certain lexical decisions until the syntactic or even the
semantic phase of program analysis. The scanner returns the same
code (Ident) for all identifiers, whether they refer to variables
or to procedures. The distinction between variable and procedure
names is really a lexical one, but it is not made until the parser
looks up the identifier in the symbol table.
Postponing this part of the lexical analysis allows the scanner to
operate without either lookahead of backtracking. At any moment,
the scanner knows only the current input symbol (the content of
variable CH), and once it has consumed that character it can never
go back to read it again.
In many compilers the scanner has broader responsibilities and is
consequently more complex. For example, if the scanner (rather
than the parser) installs identifiers in the symbol table, they
must be classified unambiguously at the outset. In Dada this might
be done by looking ahead, after finding an identifier, to see if
it is followed by an assignment operator, which would imply that
the identifier is (or ought to be) a variable name. If the
assignment operator were not found, the pointer to the input
stream would have to be backed up to the character immediately
following the identifier.
6. The organization of the symbol table and the three routines that
access the table are based on the most-closely-nested rule of
Pascal and other block-structured languages. Since the table is
searched from youngest to oldest, a call to Find(IdentifierName)
will always return a pointer to the most recent entry that matches
IdentifierName. This allows the compiler to distinguish between
identifiers that have the same name but differ in scope.
Suppose a Dada program makes the following declarations, causing
the parser to make the symbol-table calls shown. For each line the
value of CurrentScope is given in brackets:
procedure Proc1; [1] { Declare(Proc1) }
procedure Proc1A; [2] { Declare(Proc1A) }
begin [3]
end; [2] { Blot }
procedure Proc1B; [2] { Declare(Proc1B) }
procedure Proc1Ba; [3] { Declare(Proc1Ba) }
begin [4]
end; [3] { Blot }
begin [3]
end; [2] { Blot }
begin [2]
end; [1] { Blot }
procedure Proc2; [1] { Declare(Proc2) }
begin [2]
end; [1] { Blot }
The corresponding symbol-table entries would be built up by
Declare and subsequently eliminated by Blot in the following
order:
NAME SCOPE NEXT CURRENTSCOPE
FirstSym --> Proc1 1 ^<earlier> 1
Proc1 1 ^<earlier>
FirstSym --> Proc1A 2 ^Proc1 2
Proc1 1 ^<earlier>
Proc1A 2 ^Proc1
FirstSym --> Proc1B 2 ^Proc1A 2
Proc1 1 ^<earlier>
Proc1A 2 ^Proc1
Proc1B 2 ^Proc1A
FirstSym --> Proc1Ba 3 ^Proc1B 3
Proc1 1 ^<earlier>
Proc1A 2 ^Proc1
FirstSym --> Proc1B 2 ^Proc1A 2
Proc1 1 ^<earlier>
FirstSym --> Proc2 1 ^Proc1 1
The appending of a symbol to the table by Declare is straight-
forward. A new instance of the record structure is created, then
the Next field of the new record is made to point to the current
FirstSym, and finally FirstSym is adjusted to point to the new
record. Thus each new record is put at the "front" of the list.
The culling of old symbols by Blot is not quite as obvious. Each
name must be made inaccessible once the program passes beyond the
scope of that name. In the fragment above, for example, Proc2
should not be able to call Proc1A, Proc1B or Proc1Ba because those
procedures are local to Proc1. In Dada this is accomplished by
eradicating the symbol-table entry for a procedure once the "end"
that marks the limit of its scope is reached. The global variable
CurrentScope is incremented each time a block is entered; Blot
decrements CurrentScope and then works up the list, unlinking any
entries whose scope is numerically greater than CurrentScope.
Two further notes are needed. First, observe that "end" can appear
in two contexts in Dada (or Pascal): at the end of a block or at
the end of a compound statement. Blot must be called only on
exiting a block, not on completing a compound statement. Second,
the technique of making symbol-table entries inaccessible by
simply erasing them may not work in more elaborate compilers. If
there is an optimizing pass, for example, it will probably need to
consult the symbol table. The proper approach then is to include a
flag field in the record, so that a procedure like Blot can mark a
name as hidden without destroying the entry. This also allows
better error messages--"Variable referenced outside its scope"
rather than "Undeclared variable."
7. Forth systems vary a good deal, not only in the language they
accept but also in the file format they expect. The code generator
given here creates "screen" files made up of 1,024-byte blocks.
Each such block is conventionally viewed as a screen of 16 lines
with 64 characters per line. There are no end-of-line delimiters;
every character position that is not occupied by Forth text is
filled with an ASCII blank. Input in this form should be accept-
able to many Forth systems, provided they use the operating-system
disk format. The published routines were tested with versions of
PC/FORTH, from Laboratory Microsystems, Inc.
The plasticity and mutability of Forth itself raises other
problems. The various standards (fig-Forth, Forth-79, Forth-83)
differ appreciably, and there are differences even among systems
that are meant to conform to the same standard. I have tried to
confine myself to Forth words that are common to all systems, but
no doubt incompatibilities will turn up. The code generator can
produce the following Forth words:
( ) FORTH DEFINITIONS DECIMAL CONSTANT VARIABLE :
; ;S --> @ ! . CR MINUS 0= = < > DUP
DROP SWAP OVER KEY EMIT IF ELSE THEN BEGIN WHILE
REPEAT OR AND - * + / MOD
In addition, the following synonyms are defined in the header that
precedes every program:
1 CONSTANT TRUE
0 CONSTANT FALSE
: NEGATE MINUS ;
: NOT 0= ;
: <> = NOT ;
: >= < NOT ;
: <= > NOT ;
If any of this will make your Forth system choke, change it, or
add definitions to the header to provide compatibility.
8. The Dada compiler employs the parsing technique called recursive
descent, which directly reflects the recursive definition of the
language syntax. The program itself also illustrates this
structure: the routines are nested so that each one is accessible
only to the nest-higher-level routine that calls it. The overall
structure looks like this:
ParseProgram
ParseVariables
ParseBlock
ParseStatement
ParseExpression
ParseSimpleExpr
ParseTerm
ParseSignedFactor
ParseFactor
Thus to recognize a program the parser must find variable
declarations and a block; in recognizing a block it tries to
resolve the input stream into statements, and so on.
The recursive structure takes care of much of the bookkeeping that
would otherwise burden the program (and the programmer). This is
clearest in the parsing of expressions. In Forth as well as in
most machine languages expressions must be given in postfix
notation, whereas Dada and other "algebraic" languages use infix
notation. Making the conversion requires maintaining a stack of
pending operators. For example, the infix expression 3 * (4 + 5)
must be converted to the postfix 3 4 5 + *; to do so the operators
"+" and "*" must be stacked until both their operands have been
read from the input stream.
Examination of the parsing routines reveals no stack-management
apparatus. How is it done? The stack employed is the return stack
created by the storage-allocation subsystem of the Pascal com-
piler. Although HoldMultOp, for example, is defined as a single
variable, the executing program can actually create many instances
of it. Every time ParseTerm is called, a new instance of
HoldMultOp is pushed onto the return stack.
One final note on the parser: Just as the scanner can see only the
current character in the source file, the parser has access only
to the current token in the input stream. There is no explicit
lookahead or backtracking. Whereas the scanner really does its
analysis one symbol at a time, however, the parser saves a great
deal of information on its (hidden) stack. Indeed, the first
identifier in a program--the name of the program itself--is saved
until the main block at the very end of the program is reached.
9. The type checking done in DADA.PAS is not quite complete. The
operators are classified as RelOps, AddOps and MultOps according
to their precedence relations--multiplication must be done before
addition and addition before comparison. There is another dimen-
sion of classification, however, that the compiler ignores. Log-
ical OR has the same precedence as multiplication, and logical AND
is at the same level as addition, but the logical and arithmetic
operations properly apply to values of different types. As the
program stands, the compiler makes certain that the two operands
in each operation have the same type, but there is no assurance
the operator is appropriate. Thus "6 + FALSE" would generate an
error, but "6 AND 7" or "TRUE * FALSE" would pass without com-
plaint. Because Forth encodes boolean values as integers, such
expressions will also execute without causing a run-time error.
The remedy is straightforward, although it makes the code somewhat
bulkier. ParseSimpleExpr and ParseTerm must be made to analyze
logical and arithmetic expressions separately. Both operands of
AND and OR must be boolean values, and both operands of +, -, *
and / must be integers. In ParseExpression some decisions need to
be made. Boolean values can surely be compared for equality, but
is "TRUE > FALSE" a meaningful expression? All this becomes still
more complicated when additional data types, such as strings and
characters, are introduced.
10. There is a minor discrepancy between grammar and compiler in the
function ParseStatement. The grammar for Dada, following that for
Pascal, defines the semicolon as a statement separator, not a
statement terminator. Hence the proper syntax of a compound state-
ment is: begin
A := B;
B := C;
C := D
end
where there is no semicolon following the last simple statement
and preceding the "end". It's hard enough, however, remembering
where to put semicolons without also remembering special rules
about where not to put them. Many Pascal compilers therefore make
the final semicolon optional--unneeded but harmless if present.
The same policy is adopted in Dada.